Save and parse Arkane species with all statmech properties using YAML format #1402

alongd · 2018-06-13T15:34:19Z

We'd like to be able to conveniently use previously calculated statmech data in Arkane (prev. Cantherm). The purpose is to have an online repository of such files, from which users can pick the desired species at the appropriate level of theory (if available). This PR sets the ground by allowing Arkane to save and parse such files.

codecov · 2018-06-13T16:19:41Z

Codecov Report

Merging #1402 into master will decrease coverage by 0.04%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #1402      +/-   ##
==========================================
- Coverage   42.07%   42.02%   -0.05%     
==========================================
  Files         165      165              
  Lines       27824    27867      +43     
  Branches     5668     5680      +12     
==========================================
+ Hits        11706    11711       +5     
- Misses      15332    15371      +39     
+ Partials      786      785       -1

Impacted Files	Coverage Δ
rmgpy/quantity.py	`0% <0%> (ø)`	⬆️
rmgpy/data/kinetics/family.py	`58.58% <0%> (+0.29%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c61d2ef...4c77d09. Read the comment docs.

mliu49 · 2018-06-13T20:12:40Z

If this is going to be a brand new file type, I strongly suggest using an established format such as yaml, which you can parse using pyyaml to give a python dict. Using exec on user provided input files is a security risk since we could be executing arbitrary code. It would be best to convert all of our input files and database files to not require exec, but it'll take time.

Regarding the idea of having a database of statmech info, it would again be best if it's not a plain-text database like RMG-database is currently. Kehang's thermo central database uses mongoDB, which is an option, especially since it seems like it would overlap or be a part of this future statmech database. There are also tons of other options, but we should definitely put some thought into the database design.

alongd · 2018-06-15T16:20:20Z

I changed the format into .yml.
Still, we need to use eval() (instead of execute()), because loading the YML file only gives dictionaries of strings, and the objects must be reconstructed.
I suggest we add more thorough checks to the constructor functions I added to common.py that the string we're evaluating is indeed what we think it is (contains several key words, under a certain maximum length, etc.). Other suggestions are welcomed!

mliu49

I really like where this is going! I made a few comments, feel free to push back on any if you have other thoughts.

mliu49 · 2018-06-18T19:24:07Z

rmgpy/cantherm/common.py

+from rmgpy.molecule.translator import toInChI, toInChIKey
+from rmgpy.thermo import NASA, NASAPolynomial, Wilhoit, ThermoData
+from rmgpy.statmech.conformer import Conformer
+from rmgpy.statmech import IdealGasTranslation, LinearRotor, NonlinearRotor, HarmonicOscillator


I think some of these imports are unused, so you could probably remove them.

Also, imports should be grouped by standard imports, third-party libraries, then local imports (PEP 8). It's not that important, but something to consider. For example:

import logging import os.path import re import time import numpy import yaml from yaml import ... import rmgpy.constants as constants from rmgpy import ...

Thanks for informing me of the structure, I'll adhere to it.
The imports here are being used implicitly by the eval() function which creates objects from these classes if they are present in the .yml file, so I think we should leave them (otherwise it errors). Since advanced text editors mark them as unused, I think it's a good idea to add a comment next to them explaining why they are kept. I'll happily reorganize imports across Cantherm.

mliu49 · 2018-06-18T19:51:33Z

rmgpy/cantherm/common.py

+                        ' kJ/mol. This can cause significant errors in your computed rate constants. ')
+
+
+def determine_molecular_weight(species):


Instead of creating this method, you should modify the getter for Species.molecularWeight.

First, it would be more convenient for me if you could update the syntax to match my PR (this commit). Then, you can change the get method:

@property def molecularWeight(self): if self._molecularWeight is None: self._molecularWeight = quantity.Mass(self.molecule[0].getMolecularWeight(), 'kg/mol') return self._molecularWeight

Once you do that, you can access the Species.molecularWeight property regardless of whether or not it's been set.

Thanks for the comment!
I'll implement it in the Species __init__():

def __init__(self, ..., molecularWeight=None, ...): if self.molecularWeight is None and self.molecule is not None and len(self.molecule) > 0: self._molecularWeight = quantity.Mass(self.molecule[0].getMolecularWeight(), 'kg/mol') else: self.molecularWeight = molecularWeight

Shall I still change the syntax of molecularWeight? I can leave it untouched so #1329 will have less conflicts

On second though, your recommendation seems better if for some reason an empty Species object is first created, then a structure is assigned.

mliu49 · 2018-06-18T19:54:26Z

rmgpy/cantherm/common.py

+            d['thermo'] = species.thermo
+        d['chemkin_thermo_string'] = chemkin_thermo_string
+
+        self.species_dictionary = d


I don't think this should be called species_dictionary. I also think that you could save all of these parameters as normal attributes, then create a method as_dictionary if you need them as a dictionary.

If you do that, I think you also wouldn't need the update method.

I agree.
I think I still need something like the update() method, though, to only append the species attributes to this object (the object is initiated with non-species-specific attributes such as author and level_of_theory which are kept)

mliu49 · 2018-06-18T19:55:47Z

rmgpy/cantherm/data/HSOO.yml

+SMILES: '[O]OS'
+adjacencyList: 'multiplicity 2
+
+  1 S u0 p2 c0 {2,S} {4,S}


Do you know why there are extra blank lines in the adjlist?

I don't know why, it bugs me as well. But I verified that the adjList is being read correctly.
If I simply print it during the run (either using print or logging.info) it looks correct with no extra line brakes.

mliu49 · 2018-06-18T19:58:09Z

rmgpy/cantherm/data/HSOO.yml

+transport_data: TransportData(epsilon=(3625.11,'J/mol'), sigma=(5.7,'angstrom'))
+use_atom_corrections: true
+use_bond_corrections: false
+use_hindered_rotors: true


Some of these parameters don't seem like they would be important to store, e.g. use_atom_corrections, use_bond_corrections, use_hindered_rotors.

At least use_bond_corrections is crucial, the rest are just "nice to have".

use_bond_corrections: This might impact the decision of the user of whether or not to use this file. If the user wants to use this species in a PES with other species calculated at different levels of theory, it is important/recommended to use bond corrections. On the other hand, if all species were calculated at the same level, it is better practice not to use bond corrections (which don't capture all bond types, and are often ambiguous for TSs).

use_hindered_rotors: This could also be directly inferred from the Conformer object. However, for a quick assessment, it's nice to have it explicitly written.

use_atom_corrections: Can be inferred from the atom_energies attribute, I'm OK with removing it since using this feature is less common.

mliu49 · 2018-06-18T20:09:58Z

rmgpy/cantherm/input.py

        spec.conformer = Conformer(E0=E0, modes=modes, spinMultiplicity=spinMultiplicity, opticalIsomers=opticalIsomers)
+        if molecularWeight is None:
+            if is_pdep():


Does this work if the species block comes before the pdep block?

Also, if you refactor the molecular weight implementation based on my other comment, would that eliminate the need for this if block?

mliu49 · 2018-06-18T20:16:17Z

rmgpy/cantherm/input.py

+    """
+    Load the species with all statMech data from the .yml file
+    """
+    global jobList


It doesn't seem like you use jobList in this function.

mliu49 · 2018-06-18T20:59:05Z

rmgpy/cantherm/input.py

+                    coordinates = np.ndarray(shape=(len(number), 3))
+                    for i, coorlist in enumerate(job.species.conformer.coordinates.value_si):
+                        coordinates[i] = [coor for coor in coorlist]
+                    mol = Molecule().fromXYZ(atomicNums=number, coordinates=coordinates)


This section seems oddly coded. Was it done intentionally? It seems like the following would be sufficient:

number = job.species.conformer.number.value_si.tolist() coordinates = job.species.conformer.coordinates.value_si.tolist()

Yes, the issue here is that we must pass a np.ndarray type to Molecule().fromXYZ()
The data generated from .tolist() throws an error.
I'll be happy to consider a different alternative to the non straight forward code I used, though.

Oh, I misinterpreted it, I thought initially that the conformer attributes were np.ndarray objects, and you were converting them into lists.

If the conformer attributes are np.ndarray, it seems you could use them directly. If they're lists, then couldn't you use np.array(list_input) to convert it to an np.ndarray?

mliu49 · 2018-06-18T21:01:11Z

rmgpy/cantherm/statmech.py

+        filename, file_extension = os.path.splitext(self.path)
+        if 'yml' in file_extension or 'yaml' in file_extension:
+                self.load_species_from_database = True
+                return


It seems like you could just call load_species_from_database_file here instead of main.py, and it would also eliminate the need for the load_species_from_database attribute.

I agree, will change accordingly

mliu49 · 2018-06-18T21:02:50Z

rmgpy/cantherm/thermo.py

@@ -188,6 +188,30 @@ def save(self, outputFile):
                f.write('\n')
        f.close()

+        return chem_string
+
+    def save_species_to_database_file(self, path):


I think this may be more appropriate as a method of CanthermSpecies. Let me know what you think.

alongd · 2018-06-22T16:01:52Z

@mliu49, Thanks for all the valuable comments!
I addressed them in fixup commits. At some point I didn't bother correctly naming the fixup commits due to the relocation of methods between files, I think I'll re-commit it all instead of squashing.
Let me know if you have any additional comments.

alongd · 2018-06-26T12:30:45Z

@mliu49, can I squash this branch?

mliu49 · 2018-06-26T22:14:46Z

Yeah, you can go ahead and squash. I guess this might get replaced by the new method though, if that turns out ok.

mliu49 · 2018-06-27T01:23:59Z

rmgpy/rmg_obj.py

+###############################################################################
+import numpy as np
+
+class RMG_Obj(object):


A couple comments on naming:

Class names should be CamelCase, though I think we can make an exception for RMG. I would probably call this RMGObject.

Module names are ideally one word. Perhaps something like object, base, or common?

Variable names should be lowercase. I think D should be renamed to something else, like output or even output_dict.

Also, could you end the file with a newline?

mliu49 · 2018-06-27T01:25:43Z

I don't think we discussed how to handle Quantity objects with the new dict approach - we would need some way to save and load the units.

mjohnson541 · 2018-06-27T01:28:16Z

They're our own objects right? So can't we just make them inherit from RMGObject?

alongd · 2018-06-27T13:51:26Z

rmgpy/rmgobject.py

+    def __init__(self):
+        pass
+    def as_dict(self):
+        D = self.__dict__.copy()


Thanks for this addition, @mjohnson541 !
I think that self.__dict__ contains methods as well, and we should first check if not callable(value) before adding entries to the output dictionary.

alongd · 2018-07-05T18:08:52Z

I rewrote this branch using the as_dict() and make_object() methods we discussed for YAML loading and parsing.
Thanks @mjohnson541 and @mliu49 for the discussion and help with writing the methods.
The branch is now ready for another review cycle.

I'll note that since cythonized classes cannot inherit from non-cythonized classes, and vice versa, I had to make two parent classes: RMGObject which is cythonized, and RMGPYObject which isn't.
Quantity (Scalar & Array) and the thermo classes (NASA & Wilhoit) were not set to inherit from the parent objects. I think it's OK for quantity, but we could theoretically modify the way attributes are saved in the thermo classes so that they too can inherit from RMGObject. But I didn't think it was that important when I implemented it.

mliu49 · 2018-07-06T17:12:52Z

Are you sure that Python classes can't subclass Cython classes? It should be possible according to the Cython documentation (section 3.5).

alongd · 2019-01-08T16:38:29Z

rebased. @mliu49, let me know when I can squash.

alongd · 2019-01-21T14:50:04Z

@mliu49, let me know when to squash the fixups to finalize the PR

mliu49

I made a few more comments. You can go ahead and squash everything else.

mliu49 · 2019-01-21T17:15:41Z

arkane/input.py

-                                 " cannot be reconstructed.".format(spec.label))
+                if spec.molecularWeight is None and is_pdep(jobList):
+                    # If one of the jobs is pdep and no molecular weight is given or calculated, raise an error
+                    raise ValueError("No molecularWeight was entered for species {0}. Since a structure wasn't given"


Hmm, my suggestion might not have been correct, since this now misses the case where a structure is not provided. One possibility:

if molecularWeight is not None: spec.molecularWeight = molecularWeight elif spec.molecularWeight is None and is_pdep(jobList): raise ValueError()

There shouldn't be any need to check whether structure exists, because the the Species.molecularWeight property checks if it has a molecule attribute, which will be empty if structure was empty.

mliu49 · 2019-01-21T17:19:20Z

rmgpy/rmgobject.pyx

@@ -1,10 +1,11 @@
-# cython: embedsignature=True, cdivision=True
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-


This actually isn't necessary either, since this is technically not a python file.

Did you mean the .pyx or the .pxd file?

mliu49 · 2019-01-21T17:22:35Z

arkane/qchem.py

@@ -273,12 +273,13 @@ def loadEnergy(self, frequencyScaleFactor=1.):
        """
        E0 = None
        with open(self.path, 'r') as f:
-            for line in f:
+            lines = f.readlines()
+            for line in lines:


The original implementation is preferable because it treats the file as an iterator, so only one line is loaded at a time. If you use readlines(), then the entire file has to be loaded into memory.

OK, I think this isn't related to this PR (sorry) so I removed it, we could discuss it at another time (I think it might be necessary for running Arkane through ARC). Thanks for bringing this up, I removed the commit.

For ScalarQuantity and ArrayQuantity

Classes modified: Conformer, SingleExponentialDown, Mode, TransportData

simply by providing the structure

Along with its .yml file

- Added YAML description for the species() block - Added levelOfTheory and author descriptions

alongd · 2019-01-22T03:28:08Z

OK, done and squashed

mliu49

Thanks!

alongd added Type: Feature Arkane Type: User Experience labels Jun 13, 2018

alongd self-assigned this Jun 13, 2018

alongd requested a review from mjohnson541 June 13, 2018 15:34

alongd force-pushed the read_spc branch 3 times, most recently from 39614f3 to 7ac6569 Compare June 13, 2018 18:59

alongd force-pushed the read_spc branch from 7ac6569 to f6bf1c0 Compare June 15, 2018 16:15

alongd requested a review from mliu49 June 15, 2018 16:20

alongd force-pushed the read_spc branch from 36d02e0 to 43fe27b Compare June 18, 2018 13:12

mliu49 requested changes Jun 18, 2018

View reviewed changes

alongd force-pushed the read_spc branch from 43fe27b to 811316f Compare June 22, 2018 14:57

mliu49 reviewed Jun 27, 2018

View reviewed changes

mjohnson541 force-pushed the read_spc branch from 09a1450 to 1c4612a Compare June 27, 2018 01:27

alongd commented Jun 27, 2018

View reviewed changes

alongd force-pushed the read_spc branch 2 times, most recently from 322feb1 to 4bfbbb8 Compare July 5, 2018 18:01

alongd force-pushed the read_spc branch 2 times, most recently from b132315 to ad4e8c9 Compare July 6, 2018 14:33

alongd force-pushed the read_spc branch from 2d0e7d3 to 26ca108 Compare January 8, 2019 16:37

alongd force-pushed the read_spc branch from 26ca108 to 1092f27 Compare January 21, 2019 14:48

mliu49 reviewed Jan 21, 2019

View reviewed changes

alongd added 3 commits January 21, 2019 21:26

Changed reserved word string in Arkane

02f75f1

Correct numpy import and usage in arkane.thermo

527e96d

Added pyYAML to environment .yml files

de322dd

alongd force-pushed the read_spc branch from 1092f27 to 243ef1c Compare January 22, 2019 02:26

alongd added 13 commits January 21, 2019 22:27

Added the RMGObject parent class

f6469f7

Added as_dict() and make_object() methods to quantity

0e3c2c1

For ScalarQuantity and ArrayQuantity

Added the RMGObject Extension to setup.py

d1955e1

Set different classes to inherit from RMGObject

158bd19

Classes modified: Conformer, SingleExponentialDown, Mode, TransportData

Added as_dict() and make_object() methods to NASA and Wilhoit

8733751

Set Graph to inherit from object

c0b4e98

Dumpt statmech self.quantum into YAML files as well

7f4fd4f

Added the ArkaneSpecies class

7ffbd69

Dump and parse YAML files in Arkane

d608349

Added TestArkaneSpecies

e8f9cce

Modified the C2H4 and C2H6 examples to generate .yml files

ed76ec1

simply by providing the structure

Added the C2H6_from_yaml Arkane example

dd323d0

Along with its .yml file

Modifications to Arkane's documentation

4c77d09

- Added YAML description for the species() block - Added levelOfTheory and author descriptions

alongd force-pushed the read_spc branch from 243ef1c to 4c77d09 Compare January 22, 2019 03:27

alongd removed the request for review from mjohnson541 January 22, 2019 03:27

mliu49 approved these changes Jan 22, 2019

View reviewed changes

alongd merged commit 3a7b901 into master Jan 22, 2019

alongd deleted the read_spc branch January 22, 2019 15:42

mliu49 mentioned this pull request Feb 28, 2019

Load and dump TS YAML files in Arkane #1551

Merged

mliu49 mentioned this pull request May 15, 2019

RMG v2.4.0 (?) Release Planning #1591

Closed

5 tasks

		' kJ/mol. This can cause significant errors in your computed rate constants. ')


		def determine_molecular_weight(species):

Save and parse Arkane species with all statmech properties using YAML format #1402

Save and parse Arkane species with all statmech properties using YAML format #1402

Conversation

alongd commented Jun 13, 2018 • edited Loading

codecov bot commented Jun 13, 2018 • edited Loading

Codecov Report

mliu49 commented Jun 13, 2018

alongd commented Jun 15, 2018

mliu49 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alongd Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alongd commented Jun 22, 2018

alongd commented Jun 26, 2018

mliu49 commented Jun 26, 2018 • edited Loading

Choose a reason for hiding this comment

mliu49 commented Jun 27, 2018

mjohnson541 commented Jun 27, 2018

Choose a reason for hiding this comment

alongd commented Jul 5, 2018 • edited Loading

mliu49 commented Jul 6, 2018

alongd commented Jan 8, 2019

alongd commented Jan 21, 2019

mliu49 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alongd commented Jan 22, 2019

mliu49 left a comment

Choose a reason for hiding this comment

alongd commented Jun 13, 2018 •

edited

Loading

codecov bot commented Jun 13, 2018 •

edited

Loading

alongd Jun 21, 2018 •

edited

Loading

mliu49 commented Jun 26, 2018 •

edited

Loading

alongd commented Jul 5, 2018 •

edited

Loading